Accessing UK Air Quality and Meteorological Data with {openair} and {worldmet}
1 Introduction
Often the greatest obstacle to carrying out effective data analysis is accessing the data itself. Often we’ll want to synthesise multiple data sets together from different air quality monitoring or meteorological measurement station, which may be easier said than done! Consider the potential steps to do so:
First, we need to find about the measurement site(s) available to us — where are they? How many are in the area we’re interested in? What kind of sites are they (e.g., roadside, urban, rural)?
We are likely interested in many sites, so we’d have to seek out multiple sets of data. Where are these data all stored? What format are they stored in? Are they consistently formatted?
We are likely also interested in multiple years of data, so once again we’d have to download each year separately. Did the formatting change year-on-year? Did the site start/stop running in the time period we’re interested in?
We’d need to bind these data together, which may be difficult if columns were inconsistently ordered or named.
We’d then need to find a nearby met site and time-align its measurements with our air quality data. Where is the nearest met station? How does it store its data?
Finally, if we want to expand our analysis to other sites or time periods, we’d need to start this whole process again!
Thankfully, the extended openair family – openair, worldmet and {openairmaps} – streamlines going through this process significantly:
openair exports functions like
importAURN(), which allow you to import air quality measurements and statistics in a consistent way.worldmet is a smaller package which contains tools to identify met stations and import met data.
{openairmaps}can visualise AQ monitoring networks, which allows you to discover measurement sites in a more visual, interactive way.
Ensure you have these three packages installed and loaded before you continue. Also install and load the tidyverse packages dplyr and stringr, which will be used later to assist with the data importing process.
2 Accessing Air Quality Data
2.1 Obtaining Site Metadata
The first function you will likely use is importMeta(), which returns site metadata information. Supported networks include the UK Automatic Urban and Rural Network, the devolved networks (Air Quality England, Scottish, Wales and Northern Ireland), and networks managed by English local authorities. By default, basic site information is returned — the site name, code and type, and its latitude and longitude.
importMeta(source = "aurn")For more complete metadata, the all argument can be set to TRUE. This additionally returns information related to the individual pollutants measured at each site and when they started (and stopped, if appropriate) being measured. When appropriate, you’ll also receive more context related to the site (zone, agglomeration, local authority, etc.).
importMeta(source = "aurn", all = TRUE)The key piece of information which you’ll want to take from importMeta() is the site code(s) of interest. These can be used in openairs other “import” functions. To begin with, this can be done in a manual way. One way you could choose to do so is to View() the data frame in RStudio and copy the sites of interest.
site_meta <- importMeta(source = "aurn")
View(site_meta)Alternatively, you could use openairmaps::networkMap() to view the sites in their geographical context. By clicking on the different markers you will be provided all of the information presented by importMeta(all = TRUE) in a popup.
networkMap(source = "aurn")networkMap() of the AURN plotted using {openairmaps}.One of the benefits of using networkMap() is that multiple networks can be visualised at once. This can be useful in a few different ways, but most commonly when you are interested in air quality data from a particular geographic region but aren’t overly concerned about its specific origin. For example, Figure 2 shows the AURN, AQE network and English locally managed networks on one map.
networkMap(source = c("aurn", "aqe", "local"))networkMap() of the AURN, AQE network and locally managed networks.There are different networks to consider and usefully, with the date option, you can see what sites were available at any time through history. Try experimenting with the networkMap function and see if you can look back at what network(s) looked like in 1990…
Once you have the site codes of interest, you can move onto importing your data of choice.
2.2 Importing Air Quality Data
Now armed with the relevant site codes, you can now use the relevant “import” function to access your data of choice. As we have been looking at the AURN so far, the relevant function for us is importAURN(). The important arguments to provide are:
site: the site code(s) of interest, obtained fromimportMeta().year: the year(s) of interest. You can provide:a single year (e.g.,
2020)multiple individual years (e.g.,
c(2010, 2015, 2020))a range of years (e.g.,
2015:2020)
For example, the below code chunk imports air quality data for the two measurement sites in York for the years 2015 to 2020.
importAURN(
site = c("YK10", "YK11"),
year = 2015:2020
)# A tibble: 105,216 × 15
site code date nox no2 no pm10 pm2.5 v10 v2.5
<chr> <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 York Boo… YK10 2015-01-01 00:00:00 NA NA NA 29.7 26.6 8.1 6.7
2 York Boo… YK10 2015-01-01 01:00:00 NA NA NA 25.9 15.4 4.9 2.8
3 York Boo… YK10 2015-01-01 02:00:00 NA NA NA 17.4 12.7 9 4.5
4 York Boo… YK10 2015-01-01 03:00:00 NA NA NA 16.9 14 4.7 6.7
5 York Boo… YK10 2015-01-01 04:00:00 NA NA NA 15.7 16.7 3.8 4.2
6 York Boo… YK10 2015-01-01 05:00:00 NA NA NA 11.5 6.7 2.7 0.4
7 York Boo… YK10 2015-01-01 06:00:00 NA NA NA 15.4 10.6 5.1 4.4
8 York Boo… YK10 2015-01-01 07:00:00 NA NA NA 14.4 14 4.4 5
9 York Boo… YK10 2015-01-01 08:00:00 NA NA NA 13 12.1 3.2 4.9
10 York Boo… YK10 2015-01-01 09:00:00 NA NA NA 16.6 11.3 5 2.8
# … with 105,206 more rows, and 5 more variables: nv10 <dbl>, nv2.5 <dbl>,
# ws <dbl>, wd <dbl>, air_temp <dbl>
importAURN() and other similar functions can also be used in a more advanced way. Other arguments include:
pollutant: the pollutants(s) of interest. You can provide:a single pollutant (e.g.,
"nox")multiple pollutants (e.g.,
c("nox", "no2"))nothing at all! (this imports all pollutants measured by the site(s))
meta: whenTRUEthis will also return the site latitude, longitude and type.to_narrow: whenTRUEthis will change the structure of the resulting data frame to a “narrow” or “tidy” format. Much has been written about “tidy” data [@WickhamTidy2014], but this argument effectively stacks all of the pollutant columns on top of one another.hc: whenTRUEany hydrocarbon data is also imported. This isFALSEby default as relatively few sites measure hydrocarbons, most users won’t be interested in using the data, and the resulting imported data frame is considerably larger with hydrocarbons included.
The below code chunk once again imports York air quality data from 2015 to 2020, but this time only NO2, NOx and PM10 data in a narrow/tidy format, appended with site metadata.
importAURN(
site = c("YK10", "YK11"),
pollutant = c("no2", "nox", "pm10"),
year = 2015:2020,
meta = TRUE,
to_narrow = TRUE
)# A tibble: 315,648 × 8
date site code latitude longi…¹ site_…² pollu…³ value
<dttm> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 2015-01-01 00:00:00 York Bootham YK10 54.0 -1.09 Urban … no2 NA
2 2015-01-01 01:00:00 York Bootham YK10 54.0 -1.09 Urban … no2 NA
3 2015-01-01 02:00:00 York Bootham YK10 54.0 -1.09 Urban … no2 NA
4 2015-01-01 03:00:00 York Bootham YK10 54.0 -1.09 Urban … no2 NA
5 2015-01-01 04:00:00 York Bootham YK10 54.0 -1.09 Urban … no2 NA
6 2015-01-01 05:00:00 York Bootham YK10 54.0 -1.09 Urban … no2 NA
7 2015-01-01 06:00:00 York Bootham YK10 54.0 -1.09 Urban … no2 NA
8 2015-01-01 07:00:00 York Bootham YK10 54.0 -1.09 Urban … no2 NA
9 2015-01-01 08:00:00 York Bootham YK10 54.0 -1.09 Urban … no2 NA
10 2015-01-01 09:00:00 York Bootham YK10 54.0 -1.09 Urban … no2 NA
# … with 315,638 more rows, and abbreviated variable names ¹longitude,
# ²site_type, ³pollutant
One last argument which hasn’t been discussed is data_type, which controls the type of data returned by the import function. By default, this is “hourly” averaged data, but there are a few others to consider:
"daily","monthly"and"annual"are all alternative averaging periods, with the latter two also returning data capture information."15_min"pertains to SO2, and returns 15-minute average concentrations."8_hour"and"daily_max_8"both pertain to CO and O3, and return rolling 8 hour means and maxima respectively."daqi"returns Defra’s Daily Air Quality Index (DAQI). More information about this statistic is available here.
Consider the structure of the data returned when data_type = "annual". There is a column for each pollutant, alongside a separate *_capture column which shows the annual percentage data capture (0–1). data_type = "annual" pairs well with to_narrow = TRUE, which returns three columns - “species” for the pollutant name, “value” for the annual mean, and “data_capture” for the data capture percentage.
importAURN(
data_type = "annual",
year = 2020
)# A tibble: 172 × 38
uka_c…¹ code site date o3 o3_ca…² o3.su…³ o3.da…⁴ o3.ao…⁵
<fct> <chr> <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <int>
1 UKA003… ABD Aber… 2020-01-01 00:00:00 45.5 0.610 0.718 57.5 NA
2 UKA005… ABD7 Aber… 2020-01-01 00:00:00 NA NA NA NA NA
3 UKA006… ABD8 Aber… 2020-01-01 00:00:00 NA NA NA NA NA
4 UKA004… ACTH Auch… 2020-01-01 00:00:00 58.9 0.992 0.993 67.2 2038
5 UKA005… AGRN Birm… 2020-01-01 00:00:00 45.6 0.586 0.345 57.3 NA
6 UKA001… AH Asto… 2020-01-01 00:00:00 66.2 0.988 0.983 74.9 5717
7 UKA005… ARM6 Arma… 2020-01-01 00:00:00 NA NA NA NA NA
8 UKA006… BAAR Ball… 2020-01-01 00:00:00 NA NA NA NA NA
9 UKA005… BALM Ball… 2020-01-01 00:00:00 NA NA NA NA NA
10 UKA003… BAR3 Barn… 2020-01-01 00:00:00 46.0 0.858 0.759 57.2 NA
# … with 162 more rows, 29 more variables: o3.aot40f <int>, somo35 <dbl>,
# somo35_capture <dbl>, no <dbl>, no_capture <dbl>, no2 <dbl>,
# no2_capture <dbl>, nox <dbl>, nox_capture <dbl>, so2 <dbl>,
# so2_capture <dbl>, co <dbl>, co_capture <dbl>, pm10 <dbl>,
# pm10_capture <dbl>, nv10 <dbl>, nv10_capture <dbl>, v10 <dbl>,
# v10_capture <dbl>, pm2.5 <dbl>, pm2.5_capture <dbl>, nv2.5 <dbl>,
# nv2.5_capture <dbl>, v2.5 <dbl>, v2.5_capture <dbl>, gr10 <dbl>, …
We’ll be accessing UK data over the next couple of days, so it is good to get a feel for the different options and how they affect what is imported and how it is formatted.
2.3 Automation with dplyr
When we write code, our goal should be to limit the number of “human steps” needed. There are lots of reasons to do so, but just in the realm of picking out sites:
Humans are fallible and end up doing things like misspelling site names, or missing relevant sites when manually searching a network.
Humans aren’t as quick as computers at doing pattern recognition.
If something needs to change, having human steps makes that change more painful. For example, a shift from considering background sites to traffic sites is a one word change in the code, but perhaps a ten minute job by a human!
We can think of the “hunt through the data frame to find the sites you are interested in” as a gap in our script, and we need to close this gap. We need a programmatic way of getting from having an importMeta() output to having a list of sites to provide to importAURN(). This is what we need dplyr (and a bit of stringr) for!
The dplyr function we’re going to use is filter(), which uses logical operators to subset rows of a dataframe. That is quite an abstracted description, so you can also think of it as a way to write code to pull certain rows out of your data based on features of the rows! Lets try this a few different ways.
First, lets get all of the urban background sites. This uses the EQUALS == logical operator. The expression below can be read “in the site_meta data frame, filter for rows where site_type is equal to ‘Urban Background’”. This is case sensitive, so make sure you have your capitalisation correct!
site_meta <- importMeta()
filter(site_meta, site_type == "Urban Background")# A tibble: 127 × 5
site code latitude longitude site_type
<chr> <chr> <dbl> <dbl> <chr>
1 Aberdeen ABD 57.2 -2.09 Urban Background
2 Aberdeen Erroll Park ABD9 57.2 -2.09 Urban Background
3 Ballymena Ballykeel BALM 54.9 -6.25 Urban Background
4 Barnsley BARN 53.6 -1.48 Urban Background
5 Barnsley 12 BAR2 53.6 -1.49 Urban Background
6 Barnsley Gawber BAR3 53.6 -1.51 Urban Background
7 Belfast Centre BEL2 54.6 -5.93 Urban Background
8 Belfast East BEL 54.6 -5.90 Urban Background
9 Belfast South BEL3 54.6 -5.91 Urban Background
10 Bircotes BIR 53.4 -1.05 Urban Background
# … with 117 more rows
Perhaps we are interested in the AURN sites that measure CO and SO2. filter() can contain multiple logical expressions, so here we are filtering for the variable column to be either ‘CO’ or ‘SO2’ (using the %in% operator) and the end_date to be ‘ongoing’.
site_meta_all <- importMeta(all = TRUE)
filter(site_meta_all, variable %in% c("CO", "SO2"), end_date == "ongoing")# A tibble: 35 × 13
code site site_…¹ latit…² longi…³ varia…⁴ Param…⁵ start_date
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <dttm>
1 BALM Ballymena … Urban … 54.9 -6.25 SO2 Sulphu… 2010-01-01 00:00:00
2 BAR3 Barnsley G… Urban … 53.6 -1.51 SO2 Sulphu… 1997-07-07 00:00:00
3 BEL2 Belfast Ce… Urban … 54.6 -5.93 SO2 Sulphu… 1992-03-08 00:00:00
4 BEL2 Belfast Ce… Urban … 54.6 -5.93 CO Carbon… 1992-03-08 00:00:00
5 BMLD Birmingham… Urban … 52.5 -1.92 SO2 Sulphu… 2020-01-01 00:00:00
6 CARD Cardiff Ce… Urban … 51.5 -3.18 SO2 Sulphu… 1992-05-12 00:00:00
7 CARD Cardiff Ce… Urban … 51.5 -3.18 CO Carbon… 1992-05-12 00:00:00
8 CHBO Chilbolton… Rural … 51.1 -1.44 SO2 Sulphu… 2016-01-11 00:00:00
9 DERR Derry Rose… Urban … 55.0 -7.33 SO2 Sulphu… 2016-03-21 00:00:00
10 ED3 Edinburgh … Urban … 55.9 -3.18 SO2 Sulphu… 2003-11-24 00:00:00
# … with 25 more rows, 5 more variables: end_date <chr>, ratified_to <dttm>,
# zone <chr>, agglomeration <chr>, local_authority <chr>, and abbreviated
# variable names ¹site_type, ²latitude, ³longitude, ⁴variable,
# ⁵Parameter_name
Earlier we also loaded stringr, which is a package that streamlines working with character data. The function that is relevant for this work is str_detect(), which can be used in filter(). The below code can be read “in the site_meta data frame, filter for rows where the string ‘Urban’ is present in site_type”. This effectively returns all of the Urban Traffic and Urban Background sites!
filter(site_meta, str_detect(site_type, "Urban"))# A tibble: 235 × 5
site code latitude longitude site_type
<chr> <chr> <dbl> <dbl> <chr>
1 Aberdeen ABD 57.2 -2.09 Urban Background
2 Aberdeen Erroll Park ABD9 57.2 -2.09 Urban Background
3 Aberdeen Union Street Roadside ABD7 57.1 -2.11 Urban Traffic
4 Aberdeen Wellington Road ABD8 57.1 -2.09 Urban Traffic
5 Armagh Roadside ARM6 54.4 -6.65 Urban Traffic
6 Ballymena Antrim Road BAAR 54.9 -6.27 Urban Traffic
7 Ballymena Ballykeel BALM 54.9 -6.25 Urban Background
8 Barnsley BARN 53.6 -1.48 Urban Background
9 Barnsley 12 BAR2 53.6 -1.49 Urban Background
10 Barnsley Gawber BAR3 53.6 -1.51 Urban Background
# … with 225 more rows
If we want to programatically extract the York sites, we could go about it in a couple of different ways:
- We could use
filter(site_meta, str_detect(site, "York"))to extract all sites with “York” in the name. - We could use
filter(site_meta_all, local_authority == "York")to filter for the local authority being ‘York’.
In R, there are often a few different ways to do the same thing!
To pull a lot of this together, the example below shows some code to import ozone and NO2 data for every site in the South East that is currently measuring it.
# get meta data
site_meta_all <- importMeta(all = TRUE)
# filter meta data
se_o3_no2_meta <-
filter(site_meta_all,
zone == "South East",
variable %in% c("O3", "NO2"),
end_date == "ongoing")
# import (uniuqe removes duplicates)
southeast_aq <-
importAURN(
site = unique(se_o3_no2_meta$code),
year = 2021,
pollutant = c("o3","no2")
)
# see summary
summary(southeast_aq) date o3 no2
Min. :2021-01-01 00:00:00 Min. : -0.60 Min. : -1.324
1st Qu.:2021-04-02 05:45:00 1st Qu.: 35.17 1st Qu.: 6.799
Median :2021-07-02 11:30:00 Median : 52.14 Median : 13.283
Mean :2021-07-02 11:30:00 Mean : 50.78 Mean : 17.782
3rd Qu.:2021-10-01 17:15:00 3rd Qu.: 67.01 3rd Qu.: 24.652
Max. :2021-12-31 23:00:00 Max. :215.69 Max. :162.889
NA's :95200 NA's :15449
site code
Length:157680 Length:157680
Class :character Class :character
Mode :character Mode :character
Some of the ideas above will take a bit of getting used to — but they are very powerful. Even with a little bit of knowledge about how to filter data, it will be surprising just how much you can do. It is worth taking a bit of time to filter these data sets in different ways. We’ll build on these skills later in the course…
3 Accessing Meteorological Data
3.1 Importing Global Met Data
The worldmet effectively has two main functions for accessing meteorological data. First we’ll consider getMeta(), which lists met sites around the globe.
getMeta()# A tibble: 13,404 × 13
usaf wban station ctry st call latit…¹ longi…² elev(…³ begin
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <date>
1 010060 99999 EDGEOYA NO <NA> <NA> 78.2 22.8 14 1973-01-01
2 010070 99999 NY-ALESUND SV <NA> <NA> 78.9 11.9 7.7 1973-01-06
3 010080 99999 LONGYEAR SV <NA> ENSB 78.2 15.5 26.8 1975-09-29
4 010090 99999 KARL XII O… SV <NA> <NA> 80.6 25 5 1955-01-01
5 010100 99999 ANDOYA NO <NA> ENAN 69.3 16.1 13.1 1931-01-03
6 010110 99999 KVITOYA SV <NA> <NA> 80.1 31.5 10 1986-11-18
7 010150 99999 HEKKINGEN … NO <NA> <NA> 69.6 17.8 14 1980-03-14
8 010160 99999 KONGSOYA NO <NA> <NA> 78.9 28.9 20 1993-05-01
9 010170 99999 AKSELOYA SV <NA> <NA> 77.7 14.8 6 1973-01-01
10 010200 99999 SORKAPPOYA SV <NA> <NA> 76.5 16.6 10 2010-10-08
# … with 13,394 more rows, 3 more variables: end <date>, code <chr>,
# dist <lgl>, and abbreviated variable names ¹latitude, ²longitude,
# ³`elev(m)`
Much like openairmaps::networkMap(), worldmet::getMeta() can also return an interactive map:
getMeta(returnMap = TRUE)worldmet::getMeta().Often, you’ll not be sure which meteorological station is the “local” site to your air quality measurement site. To find your local met station, you can provide the lat and lon arguments to getMeta(), which will then return the n closest stations (defaulting to 10). The code chunk below shows the 5 closest met stations to the Marylebone Road air quality site.
site_met <- importMeta()
my1 <- filter(site_met, code == "MY1")
getMeta(lat = my1$latitude, lon = my1$longitude, n = 5)# A tibble: 5 × 15
usaf wban station ctry st call latit…¹ longi…² elev(…³ begin
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <date>
1 037700 99999 ST JAMES PA… UK <NA> <NA> 51.5 -0.117 5 2009-12-18
2 037683 99999 CITY UK <NA> EGLC 51.5 0.055 5.8 1988-01-29
3 036720 99999 NORTHOLT UK <NA> EGWU 51.6 -0.418 37.8 1973-01-01
4 037720 99999 HEATHROW UK <NA> EGLL 51.5 -0.461 25.3 1948-12-01
5 037663 99999 BIGGIN HILL UK <NA> EGKB 51.3 0.033 182. 1988-01-05
# … with 5 more variables: end <date>, code <chr>, longr <dbl>, latr <dbl>,
# dist <dbl>, and abbreviated variable names ¹latitude, ²longitude,
# ³`elev(m)`
When you use lat and lon and returnMap = TRUE you’ll notice that your AQ site is labelled using a red dot!
getMeta(lat = my1$latitude, lon = my1$longitude, n = 5, returnMap = TRUE)Once you’ve decided on your met station, simply use the usaf and wban along with importNOAA() to access the met data. You’ll notice some similar arguments to importAURN(). The important one to concern yourself with is “year”, where you define your the year of data you are interested in.
worldmet::importNOAA(code = "037720-99999", year = 2020)# A tibble: 8,784 × 25
code station date latit…¹ longi…² elev ws wd air_t…³
<fct> <fct> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 037720… HEATHR… 2020-01-01 00:00:00 51.5 -0.461 25.3 1.53 123. 3.4
2 037720… HEATHR… 2020-01-01 01:00:00 51.5 -0.461 25.3 2.07 118. 3.6
3 037720… HEATHR… 2020-01-01 02:00:00 51.5 -0.461 25.3 2.27 101. 4.4
4 037720… HEATHR… 2020-01-01 03:00:00 51.5 -0.461 25.3 2.6 100. 4.23
5 037720… HEATHR… 2020-01-01 04:00:00 51.5 -0.461 25.3 2.9 90.8 4.2
6 037720… HEATHR… 2020-01-01 05:00:00 51.5 -0.461 25.3 3.27 87.4 4.47
7 037720… HEATHR… 2020-01-01 06:00:00 51.5 -0.461 25.3 1.5 85 4.87
8 037720… HEATHR… 2020-01-01 07:00:00 51.5 -0.461 25.3 1 196. 4.73
9 037720… HEATHR… 2020-01-01 08:00:00 51.5 -0.461 25.3 0.833 96.9 3.7
10 037720… HEATHR… 2020-01-01 09:00:00 51.5 -0.461 25.3 1.33 58.8 4.9
# … with 8,774 more rows, 16 more variables: atmos_pres <dbl>,
# visibility <dbl>, dew_point <dbl>, RH <dbl>, ceil_hgt <dbl>, cl_1 <dbl>,
# cl_2 <dbl>, cl_3 <dbl>, cl <dbl>, cl_1_height <dbl>, cl_2_height <dbl>,
# cl_3_height <dbl>, precip_12 <dbl>, precip_6 <dbl>, pwc <chr>,
# precip <dbl>, and abbreviated variable names ¹latitude, ²longitude,
# ³air_temp
3.2 Combining Air Quality and Meteorological Data
We are now in the situation where we have multiple data frames we want to combine - one air quality data set, and one meteorological one. “Joining” can be complicated, so let’s create some example data to demonstrate:
Have a look at the data in Table 1. Notice that they share a column - “country” - though the items in it are not consistent (the UK and France are in both, but one has the USA and the other Germany).
Table 1: The two dataframes. Notice the differences in their shared columns.
| country | capital |
|---|---|
| UK | London |
| USA | Washington DC |
| France | Paris |
| country | language |
|---|---|
| UK | English |
| France | French |
| Germany | German |
The tidyverse package dplyr contains “joining” functions. To familiarise yourself with them, read the “cheat sheet” available here. There are several different joining functions - notice the differences between them below in Table 2.
Table 2: The four different kinds of ‘mutating’ joins and the functions used to achieve them.
| country | capital | language |
|---|---|---|
| UK | London | English |
| USA | Washington DC | NA |
| France | Paris | French |
| country | capital | language |
|---|---|---|
| UK | London | English |
| France | Paris | French |
| Germany | NA | German |
| country | capital | language |
|---|---|---|
| UK | London | English |
| France | Paris | French |
| country | capital | language |
|---|---|---|
| UK | London | English |
| USA | Washington DC | NA |
| France | Paris | French |
| Germany | NA | German |
Nine times out of ten, left_join() will do the job, so that’s what we’re going to use for our data. The common column between our two data frames is likely “date”, so that is what we will join by. You don’t have to explicitly tell a “join” function which column to join by - it’ll work it out, but it is nice to be explicit.
It is a good idea, to avoid data loss, for the first data frame listed to have the best time resolution - join a daily data frame with a monthly one rather than the other way around. In this case, both are hourly, so it doesn’t really matter which goes first.
The below chunk imports some air quality data for Marylebone road, some met data from Heathrow Airport, and then joins them together. Note that we use the select() function from dplyr to drop some of the met data; often we’re only interested in wind speed and direction.
aq <- importAURN(site = "my1", pollutant = c("no2", "nox", "pm2.5", "pm10"), year = 2020)
met <- worldmet::importNOAA(code = "037720-99999", year = 2020)
all <- left_join(select(met, date, ws, wd), aq, by = "date")
all# A tibble: 8,784 × 9
date ws wd no2 nox pm2.5 pm10 site code
<dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 2020-01-01 00:00:00 1.53 123. 45.8 166. 58.1 69 London Maryleb… MY1
2 2020-01-01 01:00:00 2.07 118. 52.6 189. 43.2 45 London Maryleb… MY1
3 2020-01-01 02:00:00 2.27 101. 44.8 152. 43 46.2 London Maryleb… MY1
4 2020-01-01 03:00:00 2.6 100. 40.2 144. 42.8 45.1 London Maryleb… MY1
5 2020-01-01 04:00:00 2.9 90.8 47.3 158. 36.8 40.8 London Maryleb… MY1
6 2020-01-01 05:00:00 3.27 87.4 40.4 136. 39.2 44.1 London Maryleb… MY1
7 2020-01-01 06:00:00 1.5 85 42.7 148. 38.7 44.3 London Maryleb… MY1
8 2020-01-01 07:00:00 1 196. 40.7 181. 39.3 40.4 London Maryleb… MY1
9 2020-01-01 08:00:00 0.833 96.9 42.6 147. 37.3 37.5 London Maryleb… MY1
10 2020-01-01 09:00:00 1.33 58.8 38.3 139. 34.3 34.6 London Maryleb… MY1
# … with 8,774 more rows
Notice what has happened. As we joined by the “date” column, that is at the far left. Then the met columns are next, as we provided that data frame first. Finally, the AQ data is at the right, as it was provided second.
If the second data frame matches multiple rows in the first one, it will match to both. This might mean that your resulting data frame is longer than what you put in to begin with - something to be avoided. It can be a good idea therefore to check that you’ve not gained any rows using a function like nrow().